Jump to content

Recommended Posts

When the upgrade software processed the database (MySQL), it converted and parsed old legacy content from the old legacy database schema and made it compatible with the new software and new schema.

The legacy schema was different and not supported by the new software and new schema.

The new software attempted to reconstruct the posts (among other things -- really everything) into the new schema.

A side effect was that URLs to media (graphics - png, jpg, etc..) were converted as follows:

What the original content had prior to the upgrade:  image tag for a specific URL

<img src="URL"/>  

After the IPS software converted the post, an image tag for a specific URL

<span>URL</span>

Which is obviously wrong.   

The other thing the new software did without reason is convert things like this:

After the IPS software ran, it created an image tag for a specific URL wound up like this:

<a href="URL"><span>URL</span></a>

Same URL literal in the href and body of the span.  Which is also wrong.

The code I have developed parses all of the content replacing those mistakes of the site software as follows.  If it finds either case:

<span>URL</span>
<!-- or -->
<a href="URL"><span>URL</span></a>

becomes, as necessary:

<img src="URL"/>

I did this for all of the forums_posts, now I need to do it for all of the archive_forums_posts.

The other glitch that got in the way for all of these replacements is that the legacy content contained UTF-8 encoded characters that were rejected by the MySQL statements.. so in effect if an update statement was like:

update table_name
set field = 'New value X and something else'
where CLAUSE

where X contained a non-conforming character in the Unicode character set, the SQL Statement was truncated so that only the text up to (but not including "X and something else" was NOT inserted.  That's why some of the legacy posts seemed, at first, to be truncated.

I've fixed that defect (in my code) as well so that the updates can work more reliably.

What you should know:

Be on the look out for posts anywhere that contain bare literal LINKS to media files.  It's likely that the conversion from the IPS software failed to recognize the content.  My code can and will correct them.

Be patient, it's not easy to find all of the needles in the all haystacks.

Here's a good example, just looking at archived posts (not the normal timely posts, but the posts older than 5 years that are stored differently than fresher (less than 5 yr) posts:

After the parse-run, the directory contains about 185,000 .sql files generated (files made subject to the conditions above, representing an edit to a posting to correct the defects described)...  (one file per surgical update)..  I don't need to actually make one file per update, but I do for auditing purposes.   The whole update actually is one very long series of SQL statements.

$ du -BM -s .
893M 

In any case, 893 Mbytes worth of changes to content.  For just archived posts.   Many needles.  

 

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...

Important Information

By using this site, you agree to our Terms of Use.